System freeze with general protection fault in spl

2023-03-15 00:17| 来源: 网络整理| 查看: 265

The first stack trace is really suspicious, since it suggests that the GPF occurred very early in the function, where the pointer from dbuf_kmem_cache likely was dereferenced. The code only writes to dbuf_kmem_cache once at module load and then the value should always remain the same until the module is unloaded, so if the pointer became corrupt, it would be consistent with memory corruption.

The second and third stack traces occur in places that really should not have problems. The third stack trace is also weird, because it appears to be a NULL pointer dereference that has a double bitflip in the pointer address.

Kernel memory addresses should start with ffff, but in the first one, you have pointers that start with ff77, which could have been from a double bit flip. If that happened, it might not just be those bits, but it is hard to tell. In the case of the second one, ffeb could have also been from a double bit flip. The third stack trace looks like a NULL pointer dereference, but I am not sure why we would have a NULL pointer dereference there. One possibility would be that the wrong pointer had been read when a pointer to it was dereferenced, but I have no proof of that and would need to study the disassembly to know if that is even possible.

Unfortunately, getting the disassembly of the kernel itself is somewhat involved and you would probably need to send me your kernel binary somehow for me to disassemble it manually. However, it is possible to disassemble the ZFS kernel modules fairly easily to get some information. Would you provide the output of the following commands?

objdump -d $(modinfo -n spl) | grep -A 150 ':' objdump -d $(modinfo -n zfs) | grep -A 150 ':'

I would like to see the instruction that caused the GPF in spl_kmem_cache_alloc to confirm that the pointer is from dbuf_kmem_cache and the only reliable way of doing that is to get the disassembly of your binary since the binaries are always subtly different between systems and I do not have the time to try to reproduce your system. Also, I would like to see where in zio_add_child the list and mutex functions were called, although knowing this does not seem to be very useful beyond identifying a line number, since the GPF occurred inside functions in the Linux kernel that were called from zio_add_child.

That said, this looks more consistent with memory corruption than a bug in the code base and looking at the disassembly would be more a way to rule out other possibilities than anything else. Bad memory is not the only thing that could cause memory corruption, as wild writes from bugs elsewhere in the kernel or from PCI-E devices (when an IOMMU is not in use) could cause it, but bad memory would be the typical candidate. I have heard of memtest86+ taking 72 hours to detect bad memory. :/

Another possibility is a CPU erratum. I strongly suggest making certain that your CPU microcode is up to date as a preventative measure:

https://wiki.debian.org/Microcode

【本文地址】

公司简介

联系我们